This directory contains three CSV files with metadata and demographic information about books and authors featured in Melanie Walsh and Maria Antoniak's essay "The Goodreads 'Classics': A Computational Study of Amazon, Readers, and Crowdsourced Amateur Criticism":

	- Goodreads-Classics-Metadata.csv
	- AP-Recommended-Authors.csv
	- Top-200-Open-Syllabus-Authors.csv

More information about this data, its collection, and its categorization can be found below.

*If you find any errors in this data or if you have any suggestions, please contact Melanie Walsh melanie.walsh@cornell.edu

------

Gender and Race

In each of the datasets, we hand labeled authors by their gender and race in order to explore demographic representation among the Goodreads "classics," college English syllabi, and the College Board's recommended authors list for the Advancement Placement English Literature and Composition 2020 exam.

To categorize authors by gender, we used the labels "Man" and "Woman," though we recognize that this coding is a reductive understanding of gender. These labels corresponded with the publicly available information that we could find about the authors, however.

Race is extremely complex and difficult to reduce to data, especially because racial categories differ across different societies. In this project, we chose to use racial categories from the U.S. to reflect the perspective of the majority of Goodreads users (most of its users have historically hailed from the U.S. and made up an estimated 40% of sitewide traffic in 2020, according to Quantcast).

To categorize authors by race, we used a slightly expanded version of the racial categories presented in the U.S. census: White, Black or African American, American Indian or Alaska Native, Asian, Native Hawaiian or Other Pacific Islander, Latinx, and Middle Eastern or North African (MENA). While the U.S. census currently treats Hispanic/Latino/Spanish origin as a question of ethnicity and not race, and it currently considers the MENA population as white, we include them as separate racial categories based on advocacy from groups such as the Arab American Institute and research from the U.S. Census Bureau that suggests that incorporating Latinx and MENA might lead to more reflective racial representation.

We recognize, however, that racial categories from the U.S. census, even in an expanded form, are flawed and subject to criticism. For more on Latinx and MENA as expanded racial categories, as well as the flaws and history of racial categories in the U.S. census, see Hephzibah V. Strmic-Pawl, Brandon A. Jackson, and Steve Garner, “Race Counts: Racial and Ethnic Data on the U.S. Census and the Implications for Tracking Inequality,” Sociology of Race and Ethnicity 4, no. 1 (January 1, 2018): 1–13. See also The United States Census Bureau, “About Race”; “2015 National Content Test: Race and Ethnicity Analysis Report.”

Note: If we could not find publicly available information about an author's race, we used the label "Unknown." In two cases in Top-200-Open-Syllabus-Authors.csv, the "author" is an organization and not an individual person, and in those cases we used the label "NA" for Not Applicable.

---

Goodreads-Classics-Metadata.csv

This file contains data about the 100 books shelved/tagged as a "classic" the most number of times by Goodreads users and the 100 books tagged as a "classic" and most read (https://www.goodreads.com/genres/most_read/classics) in September 2019. This data was scraped from the Goodreads website in September 2019.

There is overlap between these two categories, so in total there are 144 books.

The columns in this dataset include:

	- author
	- title
	- yearFirstPublished
	- mostReadRank
		* Ranking among the "most read" classic books
	
	- mostPopularRank
		* Ranking among the most shelved classic books
	
	- numRatings
		* Total number of Goodreads ratings of the book (between 1-5 stars)
	
	- numReviews
		* Total number of Goodreads reviews of the book
	
	- averageRating
		* Average Goodreads rating of the book (in Goodreads history)
	
	- whichListCategory
		* Whether the book was most shelved as a classic, most read in September 2019, or both
	
	- numPages
		* Number of pages in the book
	
	- original_language
		* Original language of the book
	
	- gender
		* Author gender
	
	- race (from perspective of U.S. racial logics)
		*Author race
	
	- author_nationality
		* Author nationality
	
	- ap_english
		* Whether the author was recommended by the Advancement Placement Program, English Literature & Composition, for the 2020 exam
	
	- college_syllabi
		* Whether the author was among the top 200 authors listed on college-level English syllabi (according to the Open Syllabus project in 2020)
	
	- isbn13
		* The 13-digit ISBN of the book
	
	- book_id_title
		* The book's unique Goodreads ID and title, which is used at the end of Goodreads URLs (e.g., 13079982-fahrenheit-451 -> https://www.goodreads.com/book/show/13079982-fahrenheit-451)

---

AP-Recommended-Authors.csv

This file contains data about authors that were recommended by the College Board for the Advancement Placement English Literature & Composition 2020 exam. We compiled this data from The Princeton Review’s Cracking the AP English Literature & Composition Exam, 2020 Edition: Practice Tests & Prep for the NEW 2020 Exam (New York: The Princeton Review, 2019), https://www.amazon.com/Cracking-English-Literature-Composition-Exam/dp/0525568239.

The columns in this dataset include:

	- author
	- genre
		* The genre of the author as categorized by the College Board and The Princeton Review’s Cracking the AP English Literature & Composition Exam

	- gender
	- race (from perspective of U.S. racial logics)
	- author_nationality
	- Top Goodreads Classics Author
		* If the author appeared among the 144 most shelved and/or most read "classic" works as of September 2019


---

Top-200-Open-Syllabus-Authors.csv

This file contains data about the 200 authors who appeared on the most college-level English Literature syllabi in the Open Syllabus Project (https://opensyllabus.org/) as of 2020. The data was scraped from the website here: https://opensyllabus.org/results-list/authors?size=200&fields=English%20Literature

This file includes the columns:
	- rank
		* The ranking of the author in terms of how many times they appear on college-level English literature syllabi

	- author
	- appearances
		* The number of times the author appeared on college-level English literature syllabi

	- gender
	- race (from perspective of U.S. racial logics)
	- author_nationality
	- Top Goodreads Classics Author
		* If the author appeared among the 144 most shelved and/or most read "classic" works as of September 2019


